Pandas has proven very successful as a tool for working with time series data, especially in the financial data analysis space.
In working with time series data, we will frequently seek to:
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('max_columns', 50)
We'll be using the MovieLens dataset in many examples going forward. The dataset contains 100,000 ratings made by 943 users on 1,682 movies.
In [2]:
# pass in column names for each CSV
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
df_users = pd.read_csv('data/MovieLens-100k/u.user', sep='|', names=u_cols)
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
df_ratings = pd.read_csv('data/MovieLens-100k/u.data', sep='\t', names=r_cols)
m_cols = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
df_movies = pd.read_csv('data/MovieLens-100k/u.item', sep='|', names=m_cols, usecols=range(5))# only load the first five columns